A joint learning approach for genomic prediction in polyploid grasses

Poaceae, among the most abundant plant families, includes many economically important polyploid species, such as forage grasses and sugarcane (Saccharum spp.). These species have elevated genomic complexities and limited genetic resources, hindering the application of marker-assisted selection strategies. Currently, the most promising approach for increasing genetic gains in plant breeding is genomic selection. However, due to the polyploidy nature of these polyploid species, more accurate models for incorporating genomic selection into breeding schemes are needed. This study aims to develop a machine learning method by using a joint learning approach to predict complex traits from genotypic data. Biparental populations of sugarcane and two species of forage grasses (Urochloa decumbens, Megathyrsus maximus) were genotyped, and several quantitative traits were measured. High-quality markers were used to predict several traits in different cross-validation scenarios. By combining classification and regression strategies, we developed a predictive system with promising results. Compared with traditional genomic prediction methods, the proposed strategy achieved accuracy improvements exceeding 50%. Our results suggest that the developed methodology could be implemented in breeding programs, helping reduce breeding cycles and increase genetic gains.

Pop2 and Pop3 were generated and evaluated by the Embrapa Beef Cattle (Brazilian Agricultural Research Corporation), located in Campo Grande, Mato Grosso do Sul, Brazil (20 • 27 ′ S, 54 • 37 ′ W, 530 m). Pop2 originated from a cross between U. decumbens D62 (cv. Basilisk) as the male apomictic parent and U. decumbens D24/27 as the female sexual parent obtained by tetraploidization with colchicine 29 . Plants were evaluated for the following agronomic traits: regrowth capacity (RC), field green weight in kg/ha (FGW), total dry matter (TDM) in kg/ha, leaf dry matter (LDM) in kg/ha, leaf percentage (LP), and the leaf stem ratio (LSR), as described by Mateus et al. 30 . Briefly, the field experimental design followed a simple lattice of 18 x 18 meters with two replications, seven controls and five plants per plot. These traits were evaluated for seven harvests in 2012: two in the dry season and five in the rainy season.
Pop3 was obtained from a cross between M. maximus cv. Mombaça, the apomictic parent, and an obligate sexual parent, M. maximus S10. According to Deo et al. 31 , this population was evaluated in an augmented block design with 160 regular treatments distributed in eight blocks with two replicates. Each block was composed of 22 plots with 20 individuals and two controls. Genotypes were evaluated for several agronomic traits: green matter (GM), stem dry matter (SDM), percentage of leaf blade (PLB), TDM, LDM, and RC. Phenotyping was performed in six harvests (three in 2013 and three in 2014), with the exception of phenotyping of RC, which was performed for three harvests (one in 2013 and two in 2014).
Phenotypic data analyses. Raw phenotypic data were processed according to the mixed linear models described by Aono et al. 27 , Ferreira et al. 32 , and Deo et al. 31 for Pop1, Pop2 and Pop3 respectively. Trait heritabilities were calculated based on model variances considering H 2 = σ 2 g /σ 2 p , where σ 2 g and σ 2 p are the genetic and phenotypic variances, respectively. Pop1 data were evaluated in R statistical software 33 using breedR v.0.12 34 and bestNormalize 35 for data normalization. We modeled each trait considering the following statistical model: www.nature.com/scientificreports/ where each trait measure Y ijrkm of the rth replicate of the ith genotype was evaluated at the kth location of the jth block in the mth year. The trait mean is given by µ , and the contributions of the kth location ( L k ), the mth harvest ( H m ), the jth block at the kth location in the mth year ( B j(km) ), and the interaction between the kth location and mth harvest ( LH km ) were modeled as fixed effects. The effects of the genotype G i(km) and the residual error e ijrkm were included as random effects. Pop2 phenotypic evaluations were performed using ASReml-R v3 36,37 and ASRemlPlus 38 software. The traits were modeled with the following model: where each trait measure Y ijmr of the ith genotype was evaluated in the jth block in the mth harvest considering the rth replicate. The trait mean is given by µ , and the contributions of the rth replicate ( R r ) and the mth harvest ( H m ) were modeled as fixed effects. The contributions of the ith genotype ( G i ), the jth block in the rth replicate ( B j(r) ), and the interaction between the ith genotype and the mth harvest ( GH im ) were included as random effects, as well as the residual error e ijkm .
For Pop3, a longitudinal linear mixed model was fit using ASReml-R 36 , with Box-Cox transformation 39 for data normalization. The following model was employed: where each trait measure Y ijmr of the ith genotype was evaluated in the jth block in the mth harvest considering the rth replicate. The trait mean is given by µ , and the contributions of the mth harvest ( H m ) and the rth replicate in the mth harvest ( R r(m) ) were modeled as fixed effects. The contributions of the jth block in the mth harvest ( B j(m) ), the interaction of the rth replicate and jth block in the mth harvest ( RB rj(m) ), the ith genotype (or control) in the mth harvest ( G i(m) ) and the environmental error ( e ijmr ) were modeled as random effects.
The best linear unbiased predictors (BLUPs) (Pop1) and corrected phenotypes (Pop2 and Pop3) were estimated, rescaled to a new range of values between 0 and 1 using min-max normalization and used for phenotypic predictions. For a set of values v, the transformed values f(v) are: Visual inspections of phenotypic data distributions were performed using the ggplot2 package 40 in R. Trait correlations were estimated with R Pearson correlation coefficients by using the PerformanceAnalytics package 41 in R.
Genotyping procedures. As the genotyping approach, we employed a genotyping-by-sequencing (GBS) strategy. GBS libraries were prepared following the protocol of Elshire et al. 42 for Pop1 and Pop2 and the protocol of Poland et al. 43 for Pop3. DNA extraction from Pop1 was performed using a cetyltrimethylammonium bromide (CTAB)-based protocol 44 , and two 96-plex GBS libraries were constructed using PstI, with two samples per parent. Five sequencing runs were performed with the Illumina GAIIx and Illumina NextSeq systems 27 . Genomic DNA of Pop2 was extracted using the DNeasy Plant Kit (QIAGEN, Hilden, Germany). Two 96-plex GBS libraries were constructed using NsiI and five replicates per parent and sequenced on the NextSeq 500 platform (Illumina, San Diego, CA, USA) 32 . DNA extraction from Pop3 was performed following a CTAB-based method 45 with slight modifications. Then, 96-plex GBS libraries were constructed using a combination of a rarely cutting enzyme (PstI) and a frequently cutting enzyme (MspI), with 12 replicates for each parent. Pop3 libraries were also sequenced on the NextSeq 500 platform 31 .
SNP calling in SU was performed by coupling several bioinformatics pipelines and selecting the intersecting markers by the TASSEL-GBS v.4 pipeline 46 modified for polyploids 47 51 . GBS reads were filtered and preprocessed using FastX-Toolkit scripts 52 , and comparative alignments were performed against the PhiX genome using BLASTn 53 , as described by Aono et al. 27 . Read alignment was performed with BWA-MEM version 0.7.12 54 57 . In this step, the following settings were used for both populations: very-sensitive-location, a limit of 20 dynamic programming problems and a maximum of 4 times to align a read.
After selecting the SNP sets for each population, we kept only biallelic variants with a minimum depth of 50 reads per individual and a maximum of 25% missing data. These markers were represented as allele proportions by calculating the ratio between the number of reads for the reference allele and the total number of reads 27 . Missing data were imputed as means.

Multivariate analyses.
To evaluate genotypic and phenotypic data profiles across populations, we performed principal component analyses (PCAs), t-distributed stochastic neighbor embedding (t-SNE) analyses 58 , and redundancy analyses (RDAs) 59 . Scatter plots were constructed using the ggplot2 R package. We evaluated phenotypic data using PCA and genotypic data with t-SNE in the Rtsne R package 60 , considering a perplexity of 30. We combined both datasets with RDA, which was implemented in the vegan R package 61 . RDA constrained axes were evaluated with the analysis of variance (ANOVA) F-statistic, and to identify SNPs putatively under association with the phenotypes, we used a three-standard deviation cut-off (p-value of 0.0027). Additionally, we performed a set of clustering analyses for each trait, considering hierarchical (Ward's minimum variance with and without squared dissimilarities, herein referred to as WardWSD and WardWoSD, respectively), single, complete, unweighted pair-group method with arithmetic averaging (UPGMA), weighted pair-group method with arithmetic averaging (WPGMA), weighted pair-group method with centroid averaging (WPGMC), unweighted pair group method with centroid averaging (UPGMC) and nonhierarchical (K-means) strategies together with pairwise Euclidean distances. For each of the clustering methods and traits, we calculated the clustering index by using the NbClust R package 62 , defining the best clustering scheme as that most indicated by the indexes in a cluster number range of 2 to 10. Cluster visualization was performed with the ggtree R package 63 and used to create intervals for each phenotype. The best method was defined through visual inspections considering the trait distribution and the cluster configurations on the heatmaps created.
Single regression genomic prediction. We evaluated the prediction accuracies of traditional models for GP and ML-based approaches. As traditional approaches, we selected (a) Bayesian ridge regression (BRR) 64 ; (b) Bayesian reproducing kernel Hilbert spaces regression with kernel averaging (RKHS-KA) 65 , considering different kernels with bandwidth parameters of 15M, 5M, and 1M, where M represents the median squared Euclidean distance across genotypes; and (c) a single-environment, main genotypic effect model with a Gaussian kernel (SM-GK) 66 . BRR and RKHS-KA models were implemented in the BGLR R package 67 , and the SM-GK model was implemented in the BGGE R package 66 . Given a genotype matrix (SNP data), codified using allele proportions, with p loci and n individuals, the BRR models were estimated considering the following equation: where y represents the response measures, µ the overall population mean, e the residuals ( e ∼ MVN(0 n , Iσ 2 e ) ), Z the genotype matrix, and γ the vector of SNP effects with the same normal prior distribution for all loci ( γ i |σ 2 γ ∼ N(0, σ 2 γ ) ) and the hyperparameter following a scaled inverse chi-squared hyperprior distribu- f. the degrees of freedom and S γ the scale parameter). In SM-GK, considering the same equation, Z is the incidence matrix for random genetic effects, and γ the genetic effects considering γ j |σ 2 γ ∼ N(0, σ 2 γ K) , where the matrix K is computed through a Gaussian kernel (GK) function. For pairwise K calculations between two genotypes x i and x i ′ , where h is the bandwidth parameter (herein considered 1 for SM-GK). In RKHS-KA, a multikernel model is fitted with the same GK but considering separate random effects for three established kernels. The posterior distribution of the models was assessed with the Gibbs sampler, using 20,000 interactions and discarding 2000 cycles (burn-in).
For ML-based approaches, we selected regression methods based on the following algorithms: (i) k-nearest neighbors (KNN) 68 , (ii) support vector machine (SVM) 69 , (iii) random forest (RF) 70 , and (iv) AdaBoost 71 . All of these algorithms were implemented with the scikit-learn Python v.3 module 72 . For KNN ( k = 5 ), pairwise Euclidean distances with uniform weights were considered. In SVM regression, a radial basis function kernel ( K SVM ) was used. Considering two genotypes x i and where p represents the number of loci and σ 2 Z the variance of the genotype matrix Z . The RF meta-estimator was built considering 100 decision trees for random data splits evaluated through mean squared errors (MSEs) calculated as MSE = 1 n n i=1 (y i −ŷ i ) 2 , where y i and ŷ i are the real and the predicted values, respectively, across n individuals in data split k (from 1 to 100). Tree induction was performed considering a maximum tree height of p (the number of loci). For AdaBoost, the base estimator was a decision tree regressor with a linear loss function for weighting boosting interactions.
All these models were evaluated under a cross-validation (CV) scenario with 50 10-fold repeats and the leaveone-out (LOO) approach. For each trait, predictive accuracies were estimated as MSEs and R Pearson correlations and visualized with the ggplot2 R package. The models' performances were compared with ANOVA followed by Tukey's multiple comparisons test (p-value of 0.05), implemented in the agricolae R package 73 . Phenotypic interval prediction. In addition to single regression genomic prediction, we evaluated the performance of classification models for predicting phenotypic intervals. The predictive capabilities were estimated using 10-fold CV repeated 50 times with stratification based on the intervals. The following ML algorithms were used: (i) SVM 69 , (ii) RF 70 , (iii) multilayer perceptron (MLP) neural network 74 , and (iv) Gaussian naive Bayes (GNB) 75 . All these models were estimated using the scikit-learn Python v.3 module 72 with the default parameters to provide a baseline score.
Feature selection. In this work, we used feature selection (FS) techniques to extract loci with putative phenotypic associations considering higher predictive accuracies. For this task, we tested the performance of the ML-based regression and classification models described in previous sections for predicting the phenotypes and their interval values considering scenarios with and without (the whole marker dataset) FS in a 10-fold CV repeated 50 times.
For classification FS, we employed the approach described by Aono et al. 27 . As an FS system, we combined the results of (i) L1-based FS with a linear support vector classification system 69 , (ii) ANOVA-based univariate FS (p-value of 0.05), and (iii) gradient tree boosting (GTB) 76 . In this approach, a locus is retrieved by the system if www.nature.com/scientificreports/ identified as relevant for classification in at least two out of the three FS techniques employed. This methodology was extended for regression FS, considering models adapted for a continuous response variable. Techniques (i) and (iii) were changed, considering SVM and GTB for regression, and approach (ii) was replaced with Pearson correlation (p-value cut-off of 0.05).

Marker annotation.
For each phenotype, we retrieved the selected markers and obtained a genomic window of 1000 base pairs upstream and downstream in the reference genome. With these sequences, we performed comparative alignments against coding DNA sequences (CDSs) of fourteen different species from the Poaceae family (Brachypodium distachyon, B. hybridum, B. silvatium, Hordeum vulgare, Oryza sativa, Oropetium thomaeum, Panicum hallii, P. virgatum, Sorghum bicolor, Setaria italica, S. viridis, Triticum aestivum, Thinopyrum intermedium and Zea mays) and Arabidopsis thaliana extracted from the Phytozome v.13 database 57 . All the correspondences were used to summarize related Gene Ontology (GO) terms using the Revigo tool 77  where y represents the response measures, µ the overall population mean, X the genotype matrix with the FSassociated markers, Z the incidence matrix for random genetic effects (estimated without the inclusion of X), and e the residuals. The marker effects in the FS set are represented by α , and the genetic effects of γ follow the distribution described above.
To compare predictive performance between the developed models and the other techniques, we employed the same CV scenarios described above (LOO and 10-fold), retrieving the MSE measures and the Pearson correlation coefficients. Correlation plots were constructed with the ggplot2 R package.

Simulation approaches.
To evaluate the proposed approach in larger populations in a controlled scenario, we performed simulations using AlphaSimR software 81 . We selected an entire wheat breeding program 82,83 for creating genotypic (21 chromosomes with 1000 SNPs each and coded as 0 (ancestral homozygote), 1 (heterozygote) and 2 (derived homozygote)) and phenotypic data (yield). For simulation scenarios, we considered different quantities of QTLs per chromosome (2, 10, 100 and 1000) with additive effects on yield sampled from a normal distribution (mean of 4 and variance of 0.1). Using such a configuration, 70 inbred lines were created and used for 20 years of phenotypic selection, with the first 10 years discarded as burn-in. The environmental variance was set to 0.4 and the genotype-environment interaction variance was set to 0.2.
The inbred lines were used to create 100 biparental populations with 100 individuals each, for a total of 10,000 genotypes. These were selected using (i) a preliminary yield trial (PYT), (ii) an advanced yield trial (AYT), and (iii) an elite yield trial (EYT). To advance a genotype to the PYT, we created double haploids of the 10,000 lines, visually evaluated in a head row nursery (heritability of 0.1) to select the best 1000 genotypes (10 per family). The best 100 lines in the PYT were selected for the AYT considering an unreplicated trial (heritability of 0.2), and the best 10 in the AYT were advanced to the EYT using a small multilocation replicated trial (heritability of 0.5). From the 10 lines in the EYT, the released variety was selected using a large multilocation replicated trial in 2 consecutive years (heritability of 0.67).
From the last 10 years of phenotypic selection, we selected the genotypic data of the PYT with the corresponding phenotypes simulated to create datasets for evaluating the proposed approach. From a total of 10,000 lines, we sampled different numbers of individuals (100, 250, 500, 1000, 2000, 5000 and 10,000) and performed 10-fold CV evaluations for GP.

Results
Phenotypic and genotypic data analyses. From the defined mixed models, we could estimate, for each trait and population, the corresponding BLUPs and corrected phenotypes (Supplementary Figs. S1-S3), which were used for GP. Although there were some evident correlations between traits (Supplementary Figs. S4-S6), most of the evaluated phenotypes clearly represented a different prediction task for the development of models. There was a variable range of heritabilities associated with each trait, ranging from 0.49 to 0.51 in Pop1, from 0.52 to 0.81 in Pop2, and from 0.19 to 0.64 in Pop3 (Table 1).
A total of 182 (out of 219), 219 (out of 239) and 138 (out of 138) individuals were genotyped for Pop1, Pop2, and Pop3, respectively. For each population, we generated the following number of filtered biallelic SNP markers: (i) 14,540 in Pop1 (from a total of 137,757 raw SNPs), (ii) 4548 in Pop2 (from 9354), and (iii) 5137 in Pop3 For each of the traits evaluated, we used nine different clustering methods and considered the group configuration as that most indicated by clustering indexes calculated with the NbClust package (with the cluster number ranging between 2 and 10). All nine clustering schemes identified for each trait were compared through visual inspections using dendrograms and heatmaps ( Supplementary Figs. S7-S9). The most suitable group configuration for separating the phenotypic values was assigned for each trait, resulting in different cluster quantities (Table 1). In Pop1, the WardWoSD method was the best clustering strategy. For Pop2, the WardWoSD method was also most appropriate for most of the traits, except for LSR and TDM, which were evaluated using hierarchical complete clustering and the WardWSD approach, respectively. For Pop3, all the traits were evaluated with the WardWSD approach.
To provide an overview of the genotypic and phenotypic data, we performed different multivariate analyses. For FG populations, contaminating individuals might be introduced during crosses, which could be detected by such analyses 84 . As expected, we found putative contaminants in Pop2 though t-SNE analysis performed on genotypic data ( Supplementary Fig. S10). The individuals with distinct profiles across the Cartesian plane were excluded from the dataset and not considered in further analyses (18 genotypes located apart from most of the population in the t-SNE biplot were discarded).
With the final phenotypic datasets, we performed PCAs (Fig. 1), which were used to color and size the t-SNE results from genotypic data. These plots demonstrate the lack of correspondence between the genotypic and phenotypic datasets, corroborating the difficulty of developing predictive models for such associations in the populations under analysis. By considering linear dependency across SNPs and traits, RDAs were performed, allowing the identification of a small number of putative linear associations (233 in Pop1, 37 in Pop2, and 21 in Pop3), evidencing the need for alternative approaches for categorizing the genomic regions linked to the phenotypes analyzed.
Single regression genomic prediction. The predictive performance for each trait was evaluated with seven different approaches for the regression task: BRR, RKHS-KA, SM-GK, AdaBoost, KNN, RF, and SVM. By employing 10-fold CV 50 times, we observed superior results for BRR, RKHS-KA, SM-GK and SVM (Supplementary Figs. S11-S13, Supplementary Table S1). Although not all four methods ranked best in terms of predictive performance for SH (Pop1), RC (Pop2), and PLB (Pop3), there were no pronounced differences between their accuracies and those of the highest ranking models for these traits.
For BRR, RKHS-KA, SM-GK and SVM, we also performed LOO evaluations ( Table 2). The predictive performances were small even in a CV scenario with a larger training dataset. The highest observed R values were 0.38, 0.32 and 0.21 for SH in Pop1 (SM-GK and SVM), LDM in Pop2 (SVM), and SDM in Pop3 (SVM), respectively. Additionally, we did not observe clearly superior performance for any model, which would have allowed us to select a specific approach for fitting the regression model. For each trait, one of the four selected methods presented a slightly superior accuracy compared to that of the other three, as already observed in the 10-fold CV evaluations (Supplementary Figs. S11-S13). Even with the best accuracy values observed, SH in Pop1, LDM in Pop2 and SDM in Pop3 could not be efficiently predicted, once again demonstrating the need for the development of suitable regression methods for complex polyploid grasses. www.nature.com/scientificreports/ Feature selection. To improve predictive accuracies, we combined different FS methods for both classification (C) and regression (R) techniques. The regression problem was configured considering trait value prediction, and classification, considering the prediction of the clusters defined for each trait. To supply a more restricted group of markers, we used two different approaches to marker selection: the intersection between the three FS techniques used in classification (C3) or regression (R3) and the intersection of at least two out of the three FS methods (C2 and R2). Each approach generated a different quantity of markers when considering the regression and classification problems separately (Supplementary Table S2). www.nature.com/scientificreports/ Regression-based FS generated larger quantities of markers, ranging from 898 to 980 in Pop1, from 242 to 367 in Pop2, and from 266 to 373 in Pop3. Classification-based FS, on the other hand, enabled the identification of more restricted marker subsets (from 208 to 211 in Pop1, from 86 to 117 in Pop2, and from 58 to 133 in Pop3). Interestingly, approximately half of the markers found for classification were also found for regression, showing the compatibility of these approaches. As expected, such FS methods enabled higher accuracies for ML prediction and the categorization of phenotype-associated genomic regions with diverse biological functions (Supplementary Figs. S14-S41).
Even though the ML models for predicting the different traits differed in suitability, more pronounced differences were observed when changing the marker dataset and not the ML algorithm. For some trait-model combinations, the more effective strategy was employing C2/R2, and for others, C3/R3. Regarding the biological process GO terms associated with the genes found in these marker regions, there was a clear core set of common interactions for each trait when analyzing the regression-and classification-based approaches, with differences regarding the inclusion of novel connections and categories. Joint learning. For each trait, we compared the predictive performances of the BRR, RKHS-KA and SM-GK models using the entire set of markers with the performances of SM-GK coupled with the FS approaches considering classification (C) and regression (R) algorithms. In addition to testing the intersection of the three FS techniques tested for both C and R (C3/R3) and the intersection of at least two techniques out of the three (C2/ R2), we also evaluated (i) the union of C2 and R2 (CR2), (ii) the union of C3 and R3 (CR3), and (iii) the intersection of C2 and R2 (ICR2). Finally, we tested the addition of C3 and R3 as fixed effects (C3F and R3F) with the maintenance of the remaining markers for estimating the relationship matrix K. All these models were evaluated using R Pearson correlation coefficients together with MSE measures in the 10-fold CV scenario established; these measures were contrasted with Tukey's multiple comparisons test (Supplementary Figs. S42-S55).
The correlation values calculated when subsetting the marker datasets clearly increased due to decreases in MSE values. The addition of markers as fixed effects, on the other hand, increased the R Pearson coefficients together with the MSEs; this indicates that such an increase in the correlation is derived from higher variability in the dataset rather than from better predictive performance. By selecting the best results according to Tukey's test for both evaluated metrics (Supplementary Table S3), we could establish the best models as SM-GK/R2 and SM-GK/CR2 for Pop1, SM-GK/R2, SM-GK/R3, and SM-GK/CR3 for Pop2, and SM-GK/R2, SM-GK/R3, SM-GK/CR2, and SM-GK/CR3 for Pop3.
Interestingly, we did not observe differences when applying SM-GK with R2 or CR2 and with R3 or CR3. In fact, although presenting intersections with R2 and R3 markers, C2 and C3 markers by themselves did not perform well in regression model estimation (Supplementary Figs. S42-S55). To further analyze these results, we employed LOO CV with the SM-GK regression model and the GNB classification algorithm for marker datasets R2, R3, CR2 and CR3 (Supplementary Table S4). We selected the GNB model due to the promising results observed in previous analyses (Supplementary Figs. S14-S41).
Similar to the results observed for the regression techniques, the classification algorithms performed better with the inclusion of markers selected for classification. However, as previously pointed out, the use of both  www.nature.com/scientificreports/ strategies combined (CR2 and CR3) produced satisfactory results for both classification and regression, supplying a powerful set of markers to be considered for prediction (Fig. 2). As the final approach suggested for constructing these GP models for complex polyploids, we highlight the approach of subsetting the marker dataset using the combined FS technique CR3, mainly because of (i) the high predictive accuracies observed for all traits, (ii) its ability to accommodate both classification and regression models, and (iii) the small number of markers compared to that in the CR2 dataset (for all traits, the number of markers selected for CR3 was approximately 10% of the number of markers selected for CR2). Although in some scenarios the prediction accuracies were higher for CR2, even when using CR3, they were significantly better than those observed with the traditional approaches.
Finally, by combining such an FS system based on the joint use of classification and regression ML strategies (CR3) with a popular statistical model for GP (SM-GK), we were able to suggest an ensemble method for predicting quantitative phenotypes in highly complex polyploid species. The increases in accuracy are remarkable (Fig. 3), and such a strategy also enables the construction of classification models for assisting in breeding selection strategies.

Simulation approaches.
To evaluate the feasibility of using the proposed approach for wider datasets, we performed simulations of a wheat breeding program and selected lines from the PYT for use in CV strategies. Using ten years of phenotypic selection, we were able to create 4 different datasets of 10,000 rows corresponding to genotypic data created based on 2, 10, 100 and 1,000 quantitative trait loci (QTLs). From these sets, different quantities of rows were sampled for independent evaluations (100, 250, 500, 1000, 2000, 5000, and 10,000). We contrasted the performance of the SM-GK using the entire marker dataset with the performance considering marker selection with CR3 (Fig. 4). For all the configurations tested, the prediction accuracies of SM-GK/CR3 were better than those of SM-GK, with differences in the percentage of increase depending on the configuration established.
Even though we observed slight differences in model performance when using different QTL quantities for the creation of genotypic datasets, the most pronounced increases in performance for SM-GK/CR3 compared with SM-GK were based on the quantity of individuals sampled from the data simulated and used for CV. In the smallest datasets (100 individuals), the accuracy for SM-GK/CR3 was approximately three times that of SM-GK, with a marker reduction exceeding 99%. The same increases in accuracy and decreases in marker data were also observed for 250 individuals. For 500 individuals, the increase was approximately 2 times, and the marker reduction was 98%. For more than 1000 individuals, the reductions in the marker dataset were also large (approximately 95%, 95%, 94%, and 90% for 1000, 2000, 5000, and 10,000 individuals, respectively); however, the increases in accuracy were more modest, being approximately 50%, 20%, 10%, and 5% for 1000, 2000, 5000, and 10,000 individuals, respectively.

Discussion
SU and FGs have considerable global economic importance, and their production demand is constantly rising 1,2 . In the current scenario of climate change and food insecurity, the development of novel and more productive cultivars by breeding programs represents the most effective and sustainable strategy for increasing production 85,86 . Although genomic approaches have revolutionized plant breeding with large associated genetic gains 87 , the inclusion of such techniques for the species employed in this work is still in its infancy 88,89 . The genomic complexity of these crops has hindered the development and acquisition of breeding genomic resources for many years 7,8 . In such a scenario, different molecular strategies have been tested for reducing such genomic complexity through methylation-sensitivity enzymes and facilitating the genotyping process while reducing associated costs, including GBS 43,90 . In this work, different populations genotyped with the GBS strategy were used to evaluate the estimation performance of GP models.
Even though GBS has been suitable for several studies of QTL mapping in these species 31,32,91 , developed GP strategies have presented low prediction accuracies 20,22,92,93 . Therefore, for practical incorporation of such techniques into breeding programs through GS, improved GP strategies are required, especially for most traits of interest, which are quantitative and highly polygenic and show phenotypic effects spread across unknown QTLs organized in a very complex genetic architecture 94 . Motivated by these facts, we suggested the incorporation of FS into GP as a means of circumventing the difficulty of estimating genotype and phenotype interactions and improving the low accuracies observed.
The traditional approaches employed for GP are based on the creation of regression models 25 , which estimate marker or genotype effects for the phenotype of interest, providing a means of predicting plant performances in the field 25 . In recent years, many studies have evaluated different types of statistical paradigms for such model estimation; however, most of the results show little/no changes in accuracy [95][96][97] . In contrast to the model efficiency observed for species with well-established genomic resources 98,99 , these strategies do not work properly in our scenario, inaccurately estimating the genetic effects of a specific trait and causing a significant loss of accuracy. We hypothesize that this deficiency is mainly caused by the difficulty of properly genotyping the individuals, which is due to several factors, including a lack of appropriate genomic references, the frequent occurrence of duplicated regions 7,13 , aneuploidy at different loci 100 , the difficulty of estimating correct allele dosages 101 , and the large amount of missing data 102 . Our first attempt to circumvent some of these limitations involved employing allele proportions instead of dosages, expanding the marker dataset and removing statistical assumptions regarding dosage attributions 27 .
The inclusion of ML approaches in GP has been controversial, being described as beneficial 97,[103][104][105] or not advantageous 18,106,107 . In the first scenarios that we evaluated in our work, we did not observe large increases when incorporating ML into prediction. Most of the ML algorithms performed worse than the traditional modeling www.nature.com/scientificreports/ www.nature.com/scientificreports/ approaches. Although we observed reasonable accuracies for some specific k-fold configurations (Supplementary Figs. S11-S13), this was not uniform across the CV scenarios established, which corroborates the strong impact of genotyping results on model creation. Beneficial k-fold splits for prediction represent similar training and test datasets (i.e., individuals with similar polymorphic profiles), and in such cases, GP models work accurately. Therefore, we consider that, in addition to our hypothesis that the marker datasets employed were not accurate enough for creating such models, the lack of satisfactory accuracies might be related to an absence of markers linked to several trait-associated QTLs, probably due to the GBS protocol and bioinformatics procedures employed. To evaluate these factors, we decided to check whether the exclusion of nonbeneficial markers for prediction through FS could improve model performance based on (i) possible incorrect marker estimation, (ii) the principle that not all SNPs influence the phenotype 108 , and (iii) our observation that the models were not able to automatically separate the markers most important for prediction from the rest of the SNPs. This strategy of FS has already demonstrated promising results across the literature 27,[109][110][111][112] . Subsetting datasets through principles of FS is a common strategy in data science for datasets with a large number of variables 113 . Considering the principle underlying GS that all markers can be used in model construction with different effects estimated for the traits of interest 114 , FS methods have not been required in the construction of these predictive models. However, this is not the scenario for several complex datasets, including those used in this work. An FS system in ML aims at removing redundant, incorrect and irrelevant data from the dataset prior to prediction 115 , and for a GP problem, this strategy supplies a means of selecting the most promising set of markers for prediction, which might also be indicative of QTL associations 116 .
The reduction of markers through the different FS strategies employed was effective, dramatically reducing the datasets and increasing the prediction accuracies. For the final strategy considered, the intersection of the three FS methods selected enabled the establishment of a very restricted dataset with insights into biological functions potentially associated with different trait configurations. Interestingly, all of the traits were associated with genes having biological functions related to regulatory mechanisms (e.g., regulation of transcription, translation, and DNA duplication), which is in accordance with the control mechanism impacting protein-DNA interactions expected for QTLs 117 . Additionally, important categories could be retrieved for the evaluated agronomic traits, such as carbohydrate metabolism, defense response, cellulose biosynthetic process, and response to stress. All of these GO terms have been previously associated with important biological processes regulating plant growth performance 31,118,119 . Although a detailed analysis should be performed to provide in-depth inferences about these mechanisms underlying the trait configurations, this preliminary investigation highlights the potential of the methodology to reveal such important biological features associated with target traits for breeding programs.
Considering that the final purpose of breeding programs is selecting the most promising genotypes for advancing breeding cycles and producing novel cultivars, we also considered for each individual a related phenotypic group, supplying an indicator of its overall performance. Through the initial phenotypic clustering analyses performed ( Supplementary Figs. S7-S9), it was possible to observe distinct groups of individuals with similar phenotypic values, which were considered classes in an ML-based classification scenario. Such interval prediction can be considered a complementary strategy for selecting the best potential candidates for a specific trait because of its capability of excluding a large number of candidates that have fewer chances to be among the top genotypes in terms of performance. Such interval prediction follows the same principle of ranking approaches, which have already been demonstrated as a potential tool in selective breeding 120 . Different from traditional GP models based on regression, this classification task can take advantage of a vast set of ML algorithms. Together with the marker dataset selected through FS methods applied to the quantitative traits, we coupled an FS system based on classification, selecting the markers related to these defined intervals and forming a joint strategy for prediction. Therefore, in addition to creating a prediction model for GP of quantitative traits, the same marker dataset can be applied to an independent system estimated for interval prediction.
Regarding the performance comparison of the best models (BRR, RKHS-KA, SM-GK, and SVM), we did not observe significant differences between the accuracy results. For this reason, we selected only one of them, SM-GK, for combination with the FS system. In addition to clear improvements in accuracy by such SNP subsetting, we also evaluated whether such selection could serve as a means of defining fixed effects in GP models. Although we also observed increases in the measurements of accuracy through Pearson correlation coefficients, the same beneficial aspect was not observed when evaluating the models' performance through MSEs. Instead, we observed higher predictive variability with stronger correlations and larger associated errors. Such a conclusion was previously reached when using peaks from a genome-wide association study (GWAS) as fixed effects in GP 121 . For this reason, we considered the subsetting strategy as the most promising method for boosting accuracies in our scenarios.
On the basis of the improvements observed (Fig. 3), we concluded that such an FS system was essential for showing the inefficiency of the models tested in terms of capturing real marker effects. Including a larger number of SNPs in GP may introduce background noise 110 , and in our scenario, such noise is expected to be found due to the elevated genomic complexity of these species. Interestingly, the GP models for SU presented better results than those for FGs. For SU SNP data, an SU pseudoreference genome was used for SNP calling 55 , different from the approach applied to FGs, where SNPs were estimated using genomic references from closely related species. Using Sorghum bicolor as a genomic reference for SU, a close relative, also decreases the quantity of reliable markers 27,112 and, consecutively, the chances of finding SNPs surrounding QTL regions. For this reason, we believe that the increases in accuracy in FG results were more modest than those in SU.
As the last evaluations performed, we created simulated populations using a wheat breeding program to check the feasibility of the approach in scenarios with a broader set of individuals and not only single biparental populations. One of the main limitations that we found was related to the number of individuals used for performing k-fold validation. Increases in the predictive accuracies were modest when very large populations were analysed, but even in these cases our approach still represents an important strategy to reduce marker datasets www.nature.com/scientificreports/ needed for prediction. This might contribute with cutting down genotyping costs-a constraint that remains highly relevant in breeding programs of the species under study. Increasing the number of observations in the dataset and using the same quantity of folds for CV introduce a larger number of samples to be predicted, which may correspond to a more difficult prediction task if the individuals in the training and test sets are more distantly genetically related 122 . This fact corroborates the need to develop appropriate populations for the proper application of GS models 123 . Additionally, using a small number of individuals leads to a restricted set of markers linked to the phenotypic variation being examined, which is not expected for a larger number of genotypes 124 . As expected, the methodology worked efficiently for a broad number of QTLs, showing its appropriateness for different quantitative traits, as already demonstrated in this study.
In the present study, we provided a method for GP using different datasets of polyploid grasses. Combining an FS engineering system capable of retrieving markers important for classification and regression, our methodology also shows great potential for investigating marker-trait associations. Additionally, our results highlight the benefit of incorporating methodologies for marker selection into prediction, which may also be seen as a promising approach for developing targeted sequencing methodologies that can be applied to create models for GS. With a small number of markers, we could achieve high associated accuracies for quantitative traits and for predicting putative performance intervals through classification, which can be seen as a complementary breeding tool. This strategy has the potential to aid in the development of models for species with elevated genomic complexity, surpassing the limitations of available genomic resources and supplying a means of incorporating GS into their breeding programs.

Data availability
Accession codes of sequencing data are available through the Sequence Read Archive (SRA) database with the accession numbers PRJNA478025, PRJNA472576 and PRJNA563938. www.nature.com/scientificreports/